Connectionist Computing - Final Project

Student: Finola Cahill \

Model Definition

XOR problem

Initialising and exploring the data

The data is not linearly separable

Initial test of the network

XOR-test-1

We can see that when rounded, the Sigmoid and Tanh activated networks are correctly predicting the target values.

Both the network using the Sigmoid function, and the network using Tanh have succeeded in learning XOR.

Testing effect of number of hidden units

XOR-test-2

Sigmoid is more sensitive to change than Tanh, or the "linear" activation.

Testing effect of number of varying learning rate

XOR-test-3

Both tanh sigmoid prefer a larger learning rate for this problem.

Effect of varying initialization

XOR-test-4

Tanh is relatively insensitive to weight initialisation strategies, as is the "linear" activation (which isn't successfully learning). The sigmoid function is responding much better to a xavier activation.

Effect of varying bias initialization

XOR-test-5

Again, tanh and the linear function are unaffected by the variations. The sigmoid function again proves sensitive to these modifications, with a bias of one leading to slower convergence.

Sin

Instantiating and exploring the data
Initial test of the network

sin-test-1

Tanh has by far the lowest error scores. The error on the train set is quite a bit lower than the error on the test set, the network may be overfitting slightly.

Due to the limited range of the sigmoid function, it is unable to learn this output correctly. Tanh is training very well, and the linear function is approximating the data to a reasonable level.

Testing effect of number of hidden units

sin-test-2

Again, Tanh has by far the lowest error scores, with .365 being the smallest error result we've seen on the test set.

The number of hidden units seems to have little effect on the overall final error of the networks, the boxplots are almost non-existance given the lack of range in the results. The only activation showing a small amound of reactivity is tanh, which is has its' lowest error scores with 2 and 11 hidden units.

The sigmoid function again proves more reactive that the alternatives.

Testing effect of varying learning rate

sin-test-3

The above values are SSE values, so the smaller the number, the better. The linear function did not like a larger learning rate, and failed to learn due to exploding loss/gradients until the learning rate fell below 0.05. Tanh is performing best with a learning rate of around .01. Sigmoid is failing to learn.

The loss curve of the tanh network is very unstable wiht a larger learning rate.

Test effect of varying weight initialisation

sin-test-4

Letter Recognition

Loading and exploring the data

The dataset is relatively balanced

Initial test of the network

let-test1

Sigmoid is outperforming Tanh significantly in terms of accuracy. The lienar network is not learning this problem.

The sigmoid-activated network is outperforming the tanh equivalent. The linear network has a loss of zero, not due to successful training, but due to instability in training and exploding loss issues.

Testing effect of number of hidden units

let-test2

Although all the networks have similar minimum accuracy values, the sigmoid activated network is globally outperforming tanh, whatever the number of hidden units.

For both of the above graphs, we can see that the lower loss levels correspond to a higher number of hidden units. Smaller quantities of hidden units lead to a a flatter, higher, loss curve.

Testing effect of varying learning rate

let-test3

The inter-quartile range, minimums and maximums are quite similar between the tanh and sigmoid-activated networks. But, the median values for sigmoid are significantly higher.

The higher learning rates are "bouncing" quite a bit, and never getting down closer to zero.

Here, I have "zoomed in" on the smaller learning rates. We can see that the smallest learning rates are quite smooth, but are not getting as low as the slightly larger learning rates. The smallest learning rates might get there evntually, but it will take significantly more time.

Just as we saw with the Sigmoid-activated network, the largest learning rates are unsuccessful and oscillating quite a bit.

As above, the smallest rates are "smoother" but are not getting quite as low as the slightly larger learning rates.

Effect of varying initialization

let-test4

Xavier and He initialisation perform fairly similarly. Random initialisations have fair results, and zero initialisation prevents the networks from learning.

Effect of learning rate decay

let-test5

Decay is having a positive effect on accuracy scores.

We can see the effect of decay in the bottom half of the curve, where more drastic decay leads to an almost straight line.

Final accuracy on the training set is 0.8863257550503367, and final accuracy on the test set is 0.8706

Final accuracy on the training set is 0.8413894259617308 and on the test set is 0.8244

The loss curve with out learning-rate decay oscillates wildly.